Frontiers in Bioinformatics
○ Frontiers Media SA
Preprints posted in the last 90 days, ranked by how well they match Frontiers in Bioinformatics's content profile, based on 45 papers previously published here. The average preprint has a 0.03% match score for this journal, so anything above that is already an above-average fit.
Jabin, A.; Ahmad, S.
Show abstract
Molecular profiling of tumours via RNA sequencing (RNA-seq) enables clinically actionable stratification but remains costly, tissue-intensive, and time-consuming. Recent advances in computational pathology suggest that routine H&E whole-slide images (WSIs) can be utilized to estimate transcriptomic states of cancer cells. Given the WSI-derived predictions of transcriptional signatures are noisy, their use for accurate biological interpretation faces challenges. On the other hand pathway enrichment analysis has been routinely used in describing biologically meaningful cellular states from noisy gene expression data and some studies have evaluated the ability of WSI-predicted gene expression profiles to reconstruct enriched pathways in experiments where the two data modalities were concurrently available. However, it remains unclear if a predictive model that is designed to predict enriched pathways directly from WSI samples would be better than the current approaches to do so by first predicting gene expressions. Here, we develop and evaluate these two complementary approaches for predicting pathway enrichment profiles from WSIs in TCGA Breast Invasive Carcinoma (TCGA-BRCA) by training parallel models which predict pathway enrichment directly from image features and those which rely on predicted gene expression profiles, which is the current state-of-the-art. Our results suggest that under controlled experiments direct prediction of a selected pool of enriched pathways outperforms the models trained on predicting gene expression and then inferring enrichments on predicted gene expression values. These findings will be helpful in prioritizing the goals of predictive modeling of WSI images and improving diagnostic outcomes of cancer patients.
Reddy, T.; Schneider, A.; Hall, A. R.; Witmer, A.; Hengartner, N.
Show abstract
There have been several attempts to develop machine learning (ML) models to identify human infecting viruses from their genomic sequences, with varying degrees of success. Direct comparison between models is problematic, because these models are typically trained and evaluated on different datasets with alter-native data splitting schemes, features, and model performance metrics. In this paper we present a standardized dataset of mammal infecting and non-infecting viral pathogens, refined from the previous work of Mollentze et al. to include the latest literature evidence, roughly doubling the number of curated host-virus records available to the community, and new host target labels, primate and mammal. The new host labels were included for several reasons, including previous reports that classification performance is better at broader taxonomic ranks and the idea that there may be more data for primate infection that might serve as a suitable proxy for zoonotic potential and avoidance of false positives for human infection due to absence of evidence. On this dataset, we report the performance of eight machine learning models for predicting mammal-infecting viruses from their genomic sequences. We find that randomly assigning cases in our improved dataset to training/testing sets, when compared to the original assignments into training/testing in Mollentze et al., increases the overall average ROC AUC of prediction of human infection from 0.663 {+/-} 0.070 to 0.784 {+/-} 0.013, consistent with the reduction in phylogenetic distance between train and test sets (relative entropy change from 3.00 to 0.08). The broadest host category of mammal infection can be predicted most reliably at 0.850 {+/-} 0.020. We share our improved dataset and code to enable standardized comparisons of machine learning methods to predict human host infections. Overall, we have presented preliminary evidence that classification of virus host infection is more tractable at higher taxonomic ranks, that unsurprisingly reducing the phylogenetic distance between training and test sets can improve predictive performance, that peptide kmer features appear to be harmful to out of sample model performance, and we are left with the question of whether models for virus host prediction can reasonably be expected to perform well in out of sample scenarios given the likelihood that viruses do not share a common ancestor. Consistent with this concern, when the data is resampled such that there is no overlap between viral families in training and test sets (relative entropy > 24), models perform no bet-ter than random chance at prediction of human infection regardless of whether kmers are included (ROC AUC 0.50 {+/-} 0.08) or not (ROC AUC 0.50 {+/-} 0.04). Author SummaryDetermining whether a virus can infect a human or other animal based on its genetic information is useful for assessing the threat level of circulating and newly emerging viruses. Previous studies in this domain have had access to limited datasets, and in this work we nearly double the amount of manually labelled host data for viral infection, so that others may build on it and improve it further. We use machine learning models to rank the likelihood of human and mammal infection for viruses in this improved dataset. Results are consistent with the determination of host infection being more tractable for broader categories of hosts, like mammals, than for specific species, like humans. This may suggest that the prospects are good for improved future models that first screen viruses based on their likelihood of infecting mammals, and then in a second stage for likelihood of human infection. The most challenging scenarios were for predictions of viruses that were not similar to viruses in the training data, and the question remains whether we can expect reasonable generalization of predictive models to completely new viruses given that, at the time of writing, viruses do not appear to share a common ancestor.
Schuiveling, M.; Liu, H.; Eek, D.; Hanusov, M.; van Duin, I.; ter Maat, L. S.; van der Weerd, J. C.; van den Berkmortel, F. W. P. J.; Blank, C. U.; Breimer, G. E.; Burgers, F. H.; Boers-Sonderen, M.; van den Eertwegh, A. J. M.; de Groot, J. W.; Haanen, J. B. A. G.; Hospers, G. A. P.; Kapiteijn, E.; Piersma, D.; Simkens, L. H. J.; Westgeest, H. M.; Schrader, A. M. R.; van Diest, P. J.; Lv, J.; Zhu, Y.; Tenorio, C. G. C.; Chohan, B. S.; Eastwood, M.; Raza, S. E. A.; Torbati, N.; Meshcheryakova, A.; Mechtcheriakova, D.; Mahbod, A.; Adams, D.; Galdran, A.; Pluim, J. P. W.; Blokx, W. A. M.; Suijker
Show abstract
Patients with advanced melanoma are treated with immune checkpoint inhibitors (ICIs), yet less than 50% of patients achieve a durable response while all patients are exposed to the risk of severe side effects. Tumor-infiltrating lymphocytes (TILs) in pathology images are associated with ICI outcomes, but manual assessment is subjective. In addition, the predictive value of other immune cell subsets, including plasma cells, neutrophils, histiocytes, and melanophages, remains unclear. We organized the Panoptic segmentation of nUclei and tissue in advanced MelanomA (PUMA) challenge to evaluate whether the spatial localization of TILs and other immune cell subsets on melanoma H&E slides collected before start of treatment was associated with treatment outcomes. Algorithm performance was evaluated on a hidden test set, after which top-ranked algorithms were applied to pre-treatment metastatic whole-slide images from a large, multicenter cohort of patients with advanced melanoma treated with first-line ICIs (n=1102). Automatically quantified tissue features and immune cell subsets were then associated with clinical outcomes. Top-performing algorithms improved detection of immune cell subsets, although accuracy for rare classes remained limited. Across challenge participants, TIL density showed the most consistent association with treatment response and survival. Associations for stromal TILs were weaker, while plasma cells, histiocytes, melanophages, neutrophils, necrosis and blood vessels did not show independent associations with outcomes. Overall, the results from the PUMA challenge improved the state of the art of immune cell detection in melanoma histopathology and show that intra-tumoral lymphocytes are the immune cell subset most consistently associated with treatment response and survival. HighlightsO_LIWe organized the first melanoma-specific tissue and nuclei segmentation competition C_LIO_LIWinning algorithms were applied to 1102 whole-slide images for biomarker analysis C_LIO_LIIntra-tumoral TILs were associated with response to immune checkpoint inhibitors C_LIO_LIOther immune cell subsets showed no independent association with treatment outcomes C_LIO_LITissue segmentation on WSIs was limited by low heterogeneity in training data. C_LI Graphical abstract O_FIG O_LINKSMALLFIG WIDTH=200 HEIGHT=140 SRC="FIGDIR/small/26347935v1_ufig1.gif" ALT="Figure 1"> View larger version (39K): org.highwire.dtl.DTLVardef@13838e4org.highwire.dtl.DTLVardef@1f34a6org.highwire.dtl.DTLVardef@b9a65borg.highwire.dtl.DTLVardef@58d300_HPS_FORMAT_FIGEXP M_FIG C_FIG
Kilim, O.; Martinez Ruiz, C.; Pipek, O.; Sztupinszki, Z.; Huebner, A.; Diossy, M.; Prosz, A.; Moore, D.; Jamal-Hanjani, M.; Hackshaw, A.; Fillinger, J.; Moldvay, J.; Csabai, I.; Swanton, C.; Szallasi, Z.
Show abstract
The standard treatment for stage I lung adenocarcinoma is surgical resection, in most cases without additional systemic adjuvant treatment. A significant proportion of stage I cases recur with a less than 50% 5-year survival rate. There are clinical data suggesting that adjuvant treatment may improve survival in such recurrent cases. However, previously evaluated predictors such as the IASLC grading system from histological sections and transcriptomic profiles have not been sufficiently accurate and consistent for risk stratification and to guide therapeutic interventions. We hypothesized that these previously investigated diverse diagnostic measurements carry complementary information that may provide higher prognostic power when combined. Here we describe a multimodal deep learning method, PATH-ORACLE. This biomarker is built on top of the prospectively validated transcriptomic-based ORACLE score with the addition of routine histological sections processed by pre-trained foundation models. PATH-ORACLE predicts recurrence with an accuracy of over 85% in two independent cohorts. Given further validation this predictor could be used to prioritize stage IB patients for adjuvant chemotherapy in a more consistent fashion. Furthermore, for stage IA cases, PATH-ORACLE, combined with liquid biopsy-based monitoring may help identify high-risk patients suitable for adjuvant targeted therapy. HighlightsO_LIMultimodal AI model (PATH-ORACLE) integrates histology and transcriptomics to predict stage I LUAD recurrence C_LIO_LIPATH-ORACLE outperforms IASLC grading and transcriptomic or image-based models alone C_LIO_LIModel achieves >85% recurrence prediction accuracy across independent international cohorts C_LIO_LIPATH-ORACLE refines risk stratification within both stage IA and IB lung adenocarcinoma C_LIO_LIBiomarker may guide adjuvant therapy selection and surveillance in early-stage disease C_LI
Guler, F.; Goksuluk, D.; Xu, M.; Choudhary, G.; agraz, m.
Show abstract
Applying deep learning models to RNA-Seq data poses substantial challenges, primarily due to the high dimensionality of the data and the limited sample sizes. To address these issues, this study introduces an advanced deep learning pipeline that integrates feature engineering with data augmentation. The engineering application focuses on biomedical engineering, specifically the classification of RNA-Seq datasets for disease diagnosis. The proposed framework was initially validated on synthetic datasets generated from Naive Bayes, where MLP-based augmentation yielded a notable improvement in predictive performance. Building on this foundation, we applied the approach to chromophobe renal cell carcinoma (KICH) RNA-Seq data from The Cancer Genome Atlas (TCGA). Following standard preprocessing steps normalization, transformation, and dimensionality reduction, the analysis concentrated on three main aspects: augmentation strategies, preprocessing methods, and explainable AI (XAI) techniques in relation to classification outcomes. Feature selection was performed through PCA, Boruta, and RF-based methods. Three augmentation strategies linear interpolation, SMOTE, and MixUp were evaluated. To maintain methodological rigor, augmentation was applied exclusively to the training set, while the test set was held out for unbiased evaluation. Within this framework, we conducted a comparative assessment of multiple deep learning architectures, including MLP, GNN, and the recently proposed Kolmogorov-Arnold networks (KAN). The GNN achieved the highest classification accuracy (99.47%) when trained with MixUp augmentation combined with RF feature selection, and achieved the best F1 score (0.9948). Consequently, the GNN-based XAI framework was applied to the RF dataset enriched with MixUp. XAI analyses identified the top 20 most influential genes, such as HNF4A, DACH2, MAPK15, and NAT2, which played the greatest role in classification, thereby confirming the biological plausibility of the model outputs. To further validate model robustness, cervical cancer and Alzheimers RNA-Seq datasets were also tested, yielding consistent and reliable results. Overall, the findings highlight the value of incorporating data augmentation into deep learning models for RNA-Seq analysis, not only to improve predictive performance but also to enhance biological interpretability through explainable AI approaches.
Fletcher, W. L.; Sinha, S.
Show abstract
The practices of identifying biomarkers and developing prognostic models using genomic data has become increasingly prevalent. Such data often features characteristics that make these practices difficult, namely high dimensionality, correlations between predictors, and sparsity. Many modern methods have been developed to address these problematic characteristics while performing feature selection and prognostic modeling, but a large-scale comparison of their performances in these tasks on diverse right-censored time to event data (aka survival time data) is much needed. We have compiled many existing methods, including some machine learning methods, several which have performed well in previous benchmarks, primarily for comparison in regards to variable selection capability, and secondarily for survival time prediction on many synthetic datasets with varying levels of sparsity, correlation between predictors, and signal strength of informative predictors. For illustration, we have also performed multiple analyses on a publicly available and widely used cancer cohort from The Cancer Genome Atlas using these methods. We evaluated the methods through extensive simulation studies in terms of the false discovery rate, F1-score, concordance index, Brier score, root mean square error, and computation time. Of the methods compared, CoxBoost and the Adaptive LASSO performed well in all metrics, and the LASSO and elastic net excelled when evaluating concordance index and F1-score. The Benjamini-Hoschberg and q-value procedures showed volatile performances in controlling the false discovery rate. Some methods performances were greatly affected by differences in the data characteristics. With our extensive numerical study, we have identified the best performing methods for a plethora of data characteristics using informative metrics. This will help cancer researchers in choosing the best approach for their needs when working with genomic data.
Bolut, C.; Pacary, A.; Pieruccioni, L.; Ousset, M.; Paupert, J.; Casteilla, L.; Simoncini, D.
Show abstract
Machine learning (ML) models are effective at classifying images across various fields, including biology. However, their performance on biomedical images is often limited by the small size of available datasets that are constrained by the time-consuming and costly nature of experimental data collection. A review of the literature shows that many studies using biomedical images fail to follow ML best practices. This study focuses on regenerative medicine, which aims to promote tissue regeneration rather than scarring. To explore this process, we applied ML to a limited dataset of images of mice tissues, aiming to distinguish between regenerating and scarring samples. As expected binary classification failed to generalize to independent data. A novel SHAP-based analysis revealed that the overfitting models were based on spurious correlations including individual mice characteristics that aligned with the regeneration/scarring labels. The models appeared to be solving the binary classification task, but were in fact recognizing individuals. To investigate this behavior further, we examined the test set confusion matrix of a model trained to identify individual mice. We observed that, beyond individual recognition, individuals were grouped according to the time elapsed after injury (day 3 or 10) and the healing outcome (regeneration or scarring). We hypothesized that these groupings were based on relevant biological information captured by the model. To test this hypothesis, we successfully trained a model to classify images according to the time elapsed after injury (3 or 10 days), demonstrating that ML can extract relevant biological information when the task is aligned with what the data can actually support. Altogether, this study demonstrates that carefully examining explanations of a model is not only an effective way to unveil putative biases but also to extract relevant information from a limited dataset. Author summaryMachine learning is increasingly used to analyze biomedical images, but in many experimental settings only small datasets are available, which can easily mislead powerful models. In this study, we looked at images from mice tissues, with the goal to distinguish healing by regeneration from healing by scarring. Although standard machine learning models appeared to perform well during training, they failed to generalize to new animals. By carefully analyzing model explanations, we found that the models were not learning biologically meaningful patterns of tissue repair, but instead were recognizing individual mice based on subtle image-specific signatures. Importantly, this same analysis revealed that the models did capture relevant biological information when the task was better aligned with the data, such as distinguishing early versus late stages of healing. Our results highlight how explanation methods can uncover hidden biases, prevent false conclusions, and help researchers extract meaningful biological insights even from limited and imperfect datasets.
Richardson, E.; Aarts, Y. J. M.; Altin, J. A.; Baakman, C. A. B.; Bradley, P.; Chen, B.; Clifford, J.; Dhar, M.; Diepenbroek, D.; Fast, E.; Gowthaman, R.; He, J.; Karnaukhov, V.; Marzella, D. F.; Meysman, P.; Nielsen, M.; Nilsson, J. B.; Deleuran, S. N.; Parizi, F. M.; Pelissier, A.; Pierce, B. G.; Rodriguez Martinez, M.; Roran A R, D.; Saravanakumar, S.; Shao, Y.; Smit, N.; Van Houcke, M.; Visani, G. M.; Wan, Y.-T. R.; Wang, X.; Woods, L.; Wuyts, S.; Xiao, C.; Xue, L. C.; IMMREP25 Participant Consortium, ; Barton, J.; Noakes, M.; May, D. H.; Peters, B.
Show abstract
T cell receptors (TCRs) can bind to peptides presented by MHC molecules (pMHC) as a first step to trigger a T cell response. Reliable approaches to predict TCR:pMHC binding would have broad applications in clinical diagnostics, therapeutics, and the fundamental understanding of molecular interactions. IMMREP is a community organized series of prediction contests that asks participants to predict TCR:pMHC binding on unpublished datasets. Previous iterations in 2022 and 2023 showed multiple approaches can predict TCR-pMHC binding with significant accuracy (median AUC_0.1[≥]0.7) for peptides where experimental data is available ("seen" peptides). In contrast, models did not outperform random guessing for peptides that have no such data available ("unseen" peptides). Here we report on the results of IMMREP25, which focused solely on unseen peptides in order to evaluate the cutting edge of the field. We received 126 named submissions predicting the specificity of 1,000 TCRs against twenty unseen peptides restricted by one of two MHC molecules (HLA-A*02:01 and HLA-B*40:01). The best performing methods showed a macro-AUC_0.1 of 0.60, significantly better than random, demonstrating significant advances in the field. The top performing methods incorporated structural modeling into their approach, indicating that especially for unseen peptides, a structural understanding aids in the prediction of TCR:pMHC interactions. The results from this benchmark highlight the significant challenges remaining for TCR:pMHC predictions and will inform future method development.
Gainullin, V. G.; Gray, M.; Kumar, M.; Luebker, S.; Lehman, A. M.; Choudhry, O. A.; Roberta, J.; Flake, D. D.; Shanmugam, A.; Cortes, K.; Chang, E.; Uren, P. J.; Mazloom, A.; Garces, J.; Silvestri, G. A.; Chesla, D. W.; Given, R. W.; Beer, T. M.; Diehl, F.
Show abstract
Multi-cancer early detection (MCED) tests can detect several cancer types and stages. We previously developed a methylation and protein (MP V1) MCED classifier. In this study, we present a refined MP V2 classifier, developed by evaluating model architectures that improved performance in prospectively enrolled case-control cohorts under standard testing conditions. The newly developed MP V2 classifier was trained to be more generalizable and achieve increased early-stage sensitivity at a target specificity of [≥]97.0%. MP V1 and MP V2 classifier performances were compared using a previously described test set, and MP V2 performance was also evaluated in a new independent clinical validation set. Compared to MP V1, the MP V2 classifier demonstrated a 7.3% increase in overall sensitivity, with sensitivity increases of 7.6%, 9.2%, and 8.3% for stages I, II, and stages I/II, respectively, in the intended use (breast and prostate cancers excluded) test set. In an independent validation intended use set, the MP V2 classifier showed an overall sensitivity of 55.6%, with sensitivities of 26.8%, 42.9%, and 34.8% for stages I, II, and stages I/II, respectively. In a case-control setting, the MP V2 classifier offered improved sensitivity for early-stage cancers at a lower specificity target.
Reinosa, R.
Show abstract
IntroductionThe precise determination of diagnostic cut-off points is essential for the development of multimarker panels in oncology. In previous work on pulmonary nodules, it was observed that the standard two-parameter logistic fit could be insufficient for biomarkers with asymmetric distributions. Furthermore, the calculation of empirical cut-off points based on graphical visualization presented limitations in precision and reproducibility. ObjectiveThis study presents a methodological advancement in the data analysis phase (Stage 1), introducing new Python algorithms for the direct analytical calculation of empirical intersections and robust mathematical modeling using Dual Annealing with both two-parameter and four-parameter logistic functions. This improved methodology feeds into the ThresholdXpert 1.0 software tool for combinatorial optimization of biomarker panels (Stage 2), and is applied here to the diagnostic challenge of hepatocellular carcinoma (HCC). MethodsThe methodology was first validated by re-analyzing a dataset of patients with pulmonary nodules (N=895). It was subsequently applied to an HCC dataset derived from the cohort of Jang et al. (208 HCC, 193 cirrhosis, 401 total), randomly divided into a training set (280) and an independent test set (121). Scripts were developed to compare the previous two-parameter logistic fit with the new two- and four-parameter logistic models. Finally, ThresholdXpert 1.0 was used for multimarker panel optimization. ResultsThe integration of empirical calculation, logistic modeling, and combinatorial optimization through ThresholdXpert 1.0 provides a robust and coherent framework for the development of multimarker diagnostic panels. The four-parameter logistic model provided additional validation without substantially modifying cut-off values for most biomarkers, confirming the stability of the approach while offering greater flexibility for complex distributions. When applied to hepatocellular carcinoma, the framework identified a molecular panel composed of AFP, PIVKA-II, OPN, and DKK-1 with sensitivity of 0.77 and specificity of 0.72, and an optimized panel incorporating inverse MELD that achieved the best overall balance (sensitivity 0.73, specificity 0.75) in independent external validation. These results demonstrate the potential of this approach as a generalizable tool for the optimized design of binary diagnostic systems in oncology. ConclusionThe integration of complementary mathematical modeling enhances the capability of ThresholdXpert 1.0 to identify robust diagnostic panels, as in some cases a single biomarker may outperform biomarker combinations, and vice versa. This approach enabled the integration of molecular biomarkers and clinical variables under a unified mathematical framework. Contactroberto117343@gmail.com
Cannon, M. V.; Gust, M. J.; Gross, A. C.; Cam, M.; Reinecke, J. B.; Jimenez Garcia, L.; Strawser, C. H.; Ryan, L.; Sammons, M.; Zhang, C.-Z.; Roberts, R. D.
Show abstract
MotivationSingle cell RNAseq (scRNAseq) is an ideal tool to characterize the heterogeneity within the tumor microenvironment, however, accurate identification of tumor cells can be a challenge. Reference-based methods can be inaccurate, if reference datasets are even available. Current purpose-built methods can be inaccurate, particularly with highly heterogeneous tumor types. Improved methods are needed. We explored the use of genetic variants to distinguish tumor from normal cells within scRNAseq data. ResultsWe characterized the limitations inherent to calling variants from scRNAseq data, quantifying how data sparsity precludes genetic distance calculation between single cells. As a novel workaround, we pooled data from transcriptionally similar cell clusters to call high quality variants and then calculated pairwise differences between cell populations and performed hierarchical clustering. We quantified confidence in genetic divergence between tumor and normal cell populations using bootstrapping. We performed extensive validation to assess accurate identification of tumor cells using ground-truth datasets. Application of our method to human scRNAseq samples highlighted the utility of our approach and revealed how mutational burden influences successful tumor cell identification. Improved cell type assignment in scRNAseq data will facilitate analysis of tumor samples and, in turn, accelerate our understanding of the mechanisms underlying tumor progression and reveal potential biological vulnerabilities that can be exploited to develop improved treatment options. Availability and implementationOur method is publicly available as an R package: SCANBIT (Single Cell Altered Nucleotide Based Inference of Tumor) https://github.com/kidcancerlab/scanBit.
Nguyen, D. H.
Show abstract
Numerous studies have shown that the morphological phenotype of a cell or organoid correlates with its susceptibility to anti-cancer agents. However, traditional methods of measuring phenotype rely on spatial metrics such as area, volume, perimeter, and signal intensity, which work but are limited. These approaches cannot measure many crucial features of spatial context, such as chirality, which is the property of having left- and right-handedness. Volume cannot register chirality because the left shoe and right shoe hold the harbor the same amount of volume. Though spatial context in the form of chirality, direction of gravity, and the axis of polarity are intuitive notions to humans, traditional metrics relied on by cell biologists, pathologists, radiologists, and machine learning scientists up to this point cannot register these fundamental notions. The Linearized Compressed Polar Coordinates (LCPC) Transform is a novel algorithm that can capture spatial context unlike any other metric. The LCPC Transform translates a two-dimensional (2D) contour into a discrete sinusoid wave via overlaying a grid system that tracks points of intersection between the contour and the grid lines. It turns the contour into a series of sequential pairs of discrete coordinates, with the independent coordinate (x-coordinate) being consecutive positions in 2D space. Each dependent coordinate (y-coordinate) consists of the distance, between an intersection of the contour and gridline, to the origin of the grid system. In the form of a discrete sinusoid wave, the Fast Fourier Transform is then applied to the data. In this way, the shape of cells in 2D and 3D cell culture, are represented systematically and multidimensionally, allowing for robust quantitative stratification that will reveal insights into treatment resistance. SUMMARYThis article explains how novel features of morphology in cells and organoids can be measured by the Linearized Compressed Polar Coordinates (LCPC) Transform, a spatial algorithm that measures what traditional metrics, such as area, volume, surface area, etc., cannot. Best practices for shape orientation and alignment are discussed.
Ouso, D.; Pollastri, G.
Show abstract
Deep learning (DL) has advanced computational genome annotation tasks such as protein sub-cellular localisation (SCL) prediction. Nonetheless, its potential remains underutilised, primarily because of the limited availability of high-quality reference data and suboptimal input preparation strategies. In this study, we develop and analyse a high-quality dataset derived from the latest release of the universal protein knowledgebase (UniProtKB), designed to address existing challenges and support robust DL-based SCL modelling. The dataset was constructed through extensive quality preprocessing to ensure reliability, manual label mapping to enhance the quantity and diversity of the training data, and stringent partitioning to minimise data leakage. We validated the dataset using independent test sets, achieving up to 10.8% performance improvement, measured by the area under the precision-recall curve (PR-AUC), compared to the state-of-the-art (SoTA). Furthermore, we highlighted potential performance metric inflation in existing SoTA predictors by demonstrating, for the first time, at least 4.8% training-to-testing data leakage (pre-sequence representation) when using only 10% of the training set under homology augmentation (augmentation based on sequence similarity database searches; details in Sub-section 2.1), a commonly used data augmentation strategy in DL-based SCL prediction modelling. SCL2205 will efficiently support the development of robust, trustworthy, and generalisable DL-based SCL predictors, while minimising data leakage and promoting reproducibility. It is openly available under the Creative Commons Zero (CC0 1.0) licence on DRYAD and is conveniently deployed as a package on the Python Package Index - p-scldata.
Vliora, A.; Tiberti, M.; Papaleo, E.
Show abstract
MAVISp (Multi-layered Assessment of VarIants by Structure for proteins) is a structure-based framework for facilitating mechanistic interpretation of missense variants, with protein stability as one of its core analytical layers. When software tools are updated, a key consideration for database curation is whether the new version can be adopted without compromising compatibility with existing entries. This study evaluated the effect of replacing FoldX5 with FoldX5.1 on the results of the MAVISp stability workflow. We compared predicted changes in folding free energy for 539,809 shared variants across 119 proteins. We found high overall agreement with a mean Pearson correlation of 0.933 and a mean Cohen coefficient of 0.814. Most proteins showed strong concordance, whereas only three (NUPR1, TSC1, and TMEM127) showed poor agreement. The number of disagreements was higher at sites with low AlphaFold2 confidence for NUPR1 and TSC1. These outliers did not display systematic inter-version bias, as mean shifts in folding free energies between versions were minimal. Collectively, these findings support adopting FoldX5.1 for future MAVISp data collection. We will include a transition period, during which existing entries retain FoldX5 annotations until their scheduled annual update, while new or updated entries are processed with FoldX5.1. To facilitate this transition, the FoldX software version has been added as a new metadata annotation in the MAVISp database.
Zia, M. K.; Plessinger, B.; Eng, K. H.; Flierl, A.; Wilbert, M.; Jans, K.; Whalen, P.; Mullin, S.; Ohm, J.; Singh, A. K.; Farrugia, M.; Morrison, C.; Darlak, C. J.; Seshadri, M.
Show abstract
The lack of interoperability among clinical and research data systems poses a significant barrier to cancer researchers interested in evaluating novel mechanistic hypotheses or translating innovative treatment strategies from the laboratory to the clinic. To address this gap in knowledge, we developed an innovative, web-based, data discovery, visualization and analysis tool (nSight) that allows researchers to quickly and easily query clinical/research data and construct de-identified cancer cohorts. Guiding principles for development of the tool were focused on ease of use, intuitiveness, self-service, and presentation of structured but de-identified data to the end user. nSight provides users with information on patient demographics, disease histology, diagnostic procedures and therapeutic interventions, timeline of disease progression/recurrence, along with available molecular profiling/sequencing data and indicators of participation in epidemiologic or lifestyle studies for specific cancer patient cohorts. The platform also allows users to obtain summary statistics based on demographic, histologic and clinical factors as well as perform basic survival analysis using Kaplan-Meier curves between specific patient cohorts. nSight is an intuitive, user-friendly tool that enables visualization, integration and analysis of multimodal clinical and research data without placing high technical demands or time constraints on researchers. The platform is designed for research feasibility assessment, cohort development, and retrospective data discovery, which in turn should help investigators identify potential study populations and explore novel hypotheses.
Fry Brumit, D.; Sorgen, A. A.; Fodor, A.
Show abstract
BackgroundBeta diversity quantifies pairwise differences between two or more communities through matrix transformations, which are either naive to phylogeny or phylogenetically aware. Methods have recently been introduced that also consider compositionality and sparsity and that display an increased magnitude of pseudo-F scores as produced by PERMANOVA to measure effect size. In this study, we ask how transformations that consider phylogeny, sparsity, and compositionality compare to older, simpler methods across five publicly available datasets. ResultsApplication of random forest methods to 107 features across 5 datasets did not yield a consistent increase in classification performance between different beta diversity methods. Limiting datasets to just three eigenvalue decomposition (EVD) axes leads to a small but reliably detectable decrease in performance compared to giving random forest models access to log-normalized or even un-normalized raw count tables. Increasing the number of included EVD axes in classification improves performance across all available models up to [~]10-20 axes. We observed larger variation in PERMANOVA pseudo-F scores for some features associated with phylogenetically and compositionally aware beta diversity algorithms across multiple datasets, but did not find that these improved scores yielded consistently increased resolution or accuracy for machine learning methods. ConclusionsWhile EVD remains an essential technique for dimension reduction, retaining higher-dimensional structures past 3 EVD axes may improve performance. Elevated but insignificant pseudo-F scores may be explained by the higher variance in pseudo-F scores for phylogenetically or compositionally aware methods compared to simpler methods.This indicates that pseudo-F scores are an unreliable overall metric of algorithm performance. Taken together, our results show that choice of beta diversity metric does not yield a substantial difference in effect size or machine learning performance. We conclude that analysts are free to choose appropriate methods for each dataset balancing simplicity vs. corrections for phylogeny, sparsity and compositionality and that these choices are unlikely to impact the overall power and resolution of biological conclusions from microbial data.
Brate, J.; Grande, E. G.; Pedersen, B. N.; Frengen, T. G.; Stene-Johansen, K.
Show abstract
Here we evaluated the performance of a previously published tiling PCR primer scheme by Ringlander et al. (2022) for whole-genome amplification of Hepatitis B virus (HBV) in combination with Oxford Nanopore sequencing. The primer set originally developed for Ion Torrent sequencing was adapted by removing platform-specific adapters and tested using clinical serum or plasma samples submitted for routine HBV genotyping and resistance testing. Two multiplexing strategies were compared: a single PCR pool containing all primers and a two-pool strategy with non-overlapping amplicons. Sequencing reads were processed using a Nanopore analysis pipeline, and genome coverage and amplicon performance were compared across samples spanning a wide Ct range and representing HBV genotypes A-E. Across all samples, the median genome coverage was approximately 50%, although recovery varied widely, ranging from complete failure to nearly full genomes. Combining all primers into a single PCR reaction, or separating overlapping amplicons into different reactions, had little overall impact on genome recovery, and no consistent differences between the two pooling strategies were observed. In contrast, amplification efficiency differed markedly between individual amplicons. Amplicons 1-5 generally produced higher sequencing depth, whereas amplicons 6-10 frequently showed low coverage and contributed to incomplete genome recovery. Genome coverage was strongly associated with Ct values, with higher coverage observed in samples with lower Ct values, while coverage was broadly similar across genotypes. These results demonstrate that the Ringlander et al. primer scheme can be adapted for multiplex PCR and Nanopore sequencing of HBV, but uneven amplicon performance limits consistent full-genome recovery and highlights the need for further optimization of HBV tiling PCR designs.
Zhang, X.
Show abstract
Large language model agents are increasingly used for bioinformatics tasks that require external databases, tool use, and long multi-step retrieval workflows. However, practical evaluation of these systems remains limited, especially for prompts whose target set is both large and biologically heterogeneous. Here, I benchmarked three agent systems on the same difficult retrieval task: downloading coccolithophore calcification-related proteins from UniProt across six mechanistically distinct categories, while producing category-separated FASTA files and supporting evidence. The compared systems were Codex app agents extended with Claude Scientific Skills, Biomni Lab online, and DeerFlow 2 with default skills only. Outputs were normalized at the UniProt accession level and compared category by category using overlap analysis, Venn decomposition, and a heuristic relevance assessment of each subset relative to the benchmark prompt. Across the six shared categories, Codex retrieved 2,118 proteins, DeerFlow 6,255, and Biomni 8,752 in a run. Codex showed the best balance between sensitivity and specificity: 92.4% of its proteins fell into subsets labeled high relevance and the remaining 7.6% into medium relevance. DeerFlow was substantially more exhaustive, but 43.8% of its proteins fell into low or low-medium relevance subsets. Biomni produced the largest sets, yet 69.5% of its proteins fell into low or low-medium relevance subsets, mainly due to broad expansion into generic calcium sensors, kinases, transcription factors, and poorly specific domain families. Category-specific analysis showed that Codex was the strongest primary source for inorganic carbon transport, calcium and pH regulation, vesicle trafficking, and signaling, whereas DeerFlow contributed valuable complementary matrix and polysaccharide candidates. A second run for each system also separated them strongly by repeatability: Codex had the highest within-system stability (mean category Jaccard 0.982; micro-Jaccard 0.974), DeerFlow was intermediate (0.795; 0.571), and Biomni was least stable (0.412; 0.319). These results suggest that for complex protein-family retrieval tasks, agent quality depends less on raw output volume than on prompt decomposition, taxonomic scoping, exact query generation, provenance-rich export artifacts, and repeated-run stability.
Xu, S.; Wang, Z.; Wang, H.; Ding, Z.; Zou, Y.; Cao, Y.
Show abstract
Online cancer peer-support communities generate large volumes of patient-authored and caregiver-authored text that may reflect distress, coping, and informational needs. Automated emotional tone classification could support scalable monitoring, but supervised modeling depends on label quality and may benefit from explicit context features. Using the Mental Health Insights: Vulnerable Cancer Survivors & Caregivers dataset, we compared five model families (TF-IDF Logistic Regression, Random Forest, LightGBM, GRU, and fine-tuned ALBERT) on a three-class target (Negative/Neutral/Positive) derived from four original categories. We introduced two extensions: (i) LLM-based annotation to generate parallel "AI labels" and (ii) token-based augmentation that prepends LLM-extracted structured variables (reporter role and cancer type) to the post text. Models were trained with a 60/20/20 stratified train/validation/test split, with hyperparameters selected on validation data only. Test performance was summarized using weighted F1 and macro one-vs-rest AUC with bootstrap confidence intervals, with paired comparisons based on McNemar tests and false discovery rate adjustment. The LLM annotator produced substantial redistribution in the four-class label space, shifting prevalence toward very negative relative to the original labels; the shift persisted but attenuated after collapsing to three classes. Across all model families, token augmen-tation improved held-out performance, with the largest gains for GRU and consistent improvements for ALBERT. Augmentation also reduced polarity-reversing errors (Nega-{leftrightarrow} tive Positive) for ALBERT, while adjacent errors (Negative {leftrightarrow} Neutral) remained the dominant residual failure mode. These results indicate that LLM-based supervision can introduce systematic measurement shifts that require auditing, yet LLM-extracted context incorporated via simple token augmentation provides a pragmatic, model-agnostic mechanism to improve downstream emotional tone classification for supportive oncology decision support. Author summaryWe studied how to better monitor emotional tone in posts from online cancer peer-support communities, where patients and caregivers share experiences that may signal distress, coping, or unmet needs. Automated classification could help organizations and moderators identify when additional support may be needed, but these systems depend on the quality of the labels used for training and may miss clinical context. Using a public dataset of cancer survivor and caregiver posts, we trained and compared several machine-learning and deep-learning models to classify each post as negative, neutral, or positive. We tested two practical improvements. First, we used a large language model to generate an additional set of "AI labels" and examined how these differed from the original categories. Second, we extracted simple context information--whether the writer was a patient or caregiver and what cancer type was mentioned--and added this context to the text before model training. We found that adding context consistently improved performance across model types. However, the AI-generated labels shifted class distributions, indicating that automated labeling can introduce systematic changes that should be audited. Overall, simple context extraction can make emotional tone monitoring more accurate and useful for supportive oncology decision support.
Chanraeng, N.; Guo, J.; Srisongkram, T.; Hinwan, Y.; Fransson, P.; Sjödin, H.; Matsuura, Y.; Overgaard, H. J.; Panthong, W.; Ekalaksananan, T.; Pientong, C.; Phanthanawiboon, S.
Show abstract
Assessing the human infection potential of emerging coronaviruses remains a critical challenge for global health preparedness. In this study, we developed a machine learning-based framework to predict the human infection potential of coronaviruses and to identify associated sequence motifs using spike (S) protein sequences. A total of 3,904 complete S protein sequences were collected, annotated as human or non-human infection and encoded using trimer-based k-mer features. Model benchmarking was conducted across 27 machine learning algorithms, followed by hyperparameter optimization of the selected model. Robustness and generalizability were evaluated using k-fold cross-validation and independent external validation. Feature interpretability was further assessed using SHAP analysis to identify sequence determinants associated with infection potential. The Random Forest classifier achieved the best performance, with accuracy, sensitivity, and specificity of 97.8%, 99%, and 97.4%, respectively, and demonstrated stable predictive performance across validation datasets. Notably, the KIQ and LEP motifs were strongly associated with human infection coronaviruses and mapped to the HR1 and N-terminal domain regions of the S protein. Overall, this framework provides a practical approach for risk assessment and surveillance of emerging coronaviruses. Author summaryEmerging coronaviruses continue to threaten global public health, but rapidly identifying viruses with the potential to infect humans remains challenging. Traditional experimental approaches are time-consuming and resource-intensive, limiting their use for large-scale surveillance. In this study, we developed a machine learning based workflow to assess the human infection potential of coronaviruses using spike protein sequences. By analyzing sequence patterns across a diverse set of coronaviruses, our framework enables rapid screening of coronaviruses from multiple host species. Unlike previous studies focused on limited coronavirus genera, our approach integrates all four genera and systematically evaluates multiple learning strategies. Importantly, our analysis identifies conserved sequence motifs linked to human infection potential, bridging predictive performance with biological interpretability. Our findings demonstrate computational approaches support early warning systems for identifying high risk coronaviruses, contributing to prioritize viruses for experimental validation, guide surveillance efforts, and strengthen global pandemic preparedness under a One Health perspective.